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Abstract 

Respondent-driven sampling is a form of link-tracing network sampling, 
which is widely used to study hard-to-reach populations, often to estimate 
population proportions. Previous treatments of this process have used a with- 
replacement approximation, which we show induces bias in estimates for large 
sample fractions and differential network connectedness by characteristic of 
interest. 

We present a treatment of respondent-driven sampling as a successive sam- 
pling process. Unlike existing representations, our approach respects the es- 
sential without-replacement feature of the process, while converging to an ex- 
isting with-replacement representation for small sample fractions, and to the 
sample mean for a full-population sample. 

We present a successive-sampling based estimator for population means 
based on respondent-driven sampling data, and demonstrate its superior per- 
formance when the size of the hidden population is known. We present sen- 
sitivity analyses for unknown population sizes. In addition, we note that like 
other existing estimators, our new estimator is subject to bias induced by the 
selection of the initial sample. Using data collected among three populations 
in two countries, we illustrate the application of this approach to populations 
with varying characteristics. We conclude that the successive sampling esti- 
mator improves on existing estimators, and can also be used as a diagnostic 
tool when population size is not known. 
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1 Background 



Researchers and government officials are often interested in characteristics of hu- 
man populations for which there are no practicable sampling frames for direct 
sampling. In some such hidden populations, members are connected through so- 
cial networks. A common approach is to collect a sample using a variant of link- 
tracing ( Thompson! 2006a|b[ ), such as a snowball sample (Goodman, 1961[ ), where 
subsequent sample members are selected based on their relations with previously 
sampled individuals. When the initial sample is not a probability sample, this 
approach does not result in a probability sample. However, most available alter- 
natives, ( |Muhib et ar||200T|[Piterson et al.H2008HWatters and Biemacki[[T989| also 
fail to produce a probability sample of the population. 

Respondent-Driven Sampling (RDS, introduced by Heckathorn 1997, 2002, see 
also Salganik and Heckathorn, 2004, Volz and Heckathorn 2008) is a recently intro- 
duced variant of link-tracing sampling which increases the ease of sampling and 
produces samples which, it is argued, approach probability samples as sampling 
progresses. The lack of satisfactory alternatives has spurned a strong demand for 
RDS ( [Johnston et"am2008||Lansky et al.[[2007) |. 

RDS presents two main innovations: a sampling design and a corresponding 



approach to estimation. The former is highly effective in many settings (Lansky 
et al.[ 2007[ ); the respondent-driven design relies on the respondents at each wave to 
select the next wave by distributing uniquely identified coupons to others in the 
target population, who can choose to return the coupons to enroll in the study. 
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Thus, the sampling exploits the network of social relations while also avoiding the 
confidentiality concerns associated with recording the names of contacts. 

The key innovation for estimation is that through many waves of sampling, 
the dependence of the final sample on the initial sample is reduced, allowing re- 
searchers more confidence in making approximate probability statements about 
the resulting samples. This insight allows for statistical inference in settings where 
the initial sample is typically selected by a convenience mechanism. Although cur- 
rent inference is likely superior to the alternative non-probability methods, existing 



methods are sensitive to deviations from many assumptions (Gile and Handcock 



2010 Goel and Salganik 2009| Neely[ 2009[ ). This paper offers a modification of the 



existing theoretical formulation of respondent-driven sampling, and correspond- 
ing inference to address what we find to be a serious conceptual weakness of ex- 
isting work: the known inaccuracy of the with-replacement approximation to the 
sampling process. 

In the next section, we begin by introducing current RDS estimation, particu- 



larly the estimator introduced by Volz and Heckathom (2008), and illustrate the 
sensitivity of this estimator to the with-replacement sampling assumption in cases 
of substantial sample fractions. In Sections |3] and |4| we then introduce a new model 
for RDS sampling based on successive sampling, and introduce a new estimator 
based on that model. In Section |5| we use a simulation study to illustrate the su- 
perior performance of the new estimator. We also include sensitivity analyses con- 
cerning inaccurate estimation of the size of the target population and the charac- 
teristics of the initial sample. In Section[6| we apply our estimator to data collected 
in three populations of drug users and men who have sex with men. We conclude 
with a broader discussion of the method and its limitations. 
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2 Previous Approaches to Estimation 

The basic ideas underlying estimation from RDS data are clever and important. 
They allow for something like valid statistical inference, in a sampling setting 
where the target population cannot be effectively reached using a traditional sam- 
pling frame. The original article, ( Heckathorn| |1997| ) made very strong assump- 



tions about the sampling procedure so as to assume that the sample proportions 
were representative of the population proportions. Salganik and Heckathorn ( 2004[ ) 



introduced a Markov chain argument for population mixing, and proposed an 
estimator based on equating the number of cross-relations between pairs of sub- 
populations of interest, based on the referral patterns of each group. This esti- 
mator is currently in wide use, and is implemented in the standard RDS analysis 
software ( |Volz et al.[|2007D . |Volz and Heckathorn] ( |2008| ) cormect RDS estimation to 



mainstream survey sampling through the use of a generalized Horvitz-Thompson 
estimator form. This estimator relies on the estimation of the inclusion probabili- 
ties of the sampled units, ttj. Based on an argument for treating the sample as inde- 
pendent draws from the stationary distribution of a random walk on the nodes of 
an undirected graph, Volz and Heckathorn ( 2008[ ) approximate the sampling prob- 



abilities as proportional to nodal degree, di, or number of incident edges in the 
graph. This estimator avoids the problem of potentially unknown population size 
N by using the generalized Horvitz-Thompson form, normalizing by an estimator 
of as follows: 
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where Sj = 1 indicates that the i*^ unit has been selected for sampling, and Sj = 
indicates it has not been selected, and Zj represents the variable of interest mea- 
sured on the i^^ unit. 



We refer to this approach as the Volz-Heckathorn (VH) estimator. Gile and 



Handcock ( 2010| ) illustrate that this estimator consistently out-performs the Salganik- 



Heckathorn estimator, and we believe it is the most principled estimator currently 
available for RDS. For these reasons, we use the VH estimator as the standard for 
comparison in this paper. 



Many critical assumptions required to justify this estimator are explored in Gile 



and Handcock (2010) and listed in Table |2j In this paper, we focus on the reliance 



on a with-replacement sampling model. We present an estimator based on an alter- 
native sampling model reflecting the without-replacement nature of the sampling 
process. 



3 Successive Sampling for RDS 

The Volz-Heckathorn estimator requires many waves of sampling to justify its re- 
liance on a stationary distribution. In practice the number of waves is small (al- 
most always fewer than 20, and often 5 or fewer). Also, we wish to consider a 
without-replacement process for which stationarity does not apply. It is therefore 
instructive to consider a special case when stationarity is not necessary. 

Consider a graph formed in the following manner: Begin with vertices, des- 
ignated by indices 1 : A^. Assign to each vertex a number of edge-ends according 
to arbitrary fixed degree distribution N = Ni, N2, . . . Nx/ where Nj is the number of 
nodes of degree j, 1 and K are the minimum and maximum degrees, respectively, 
and subject to the constraint that twice the number of edges, 2E = Yl,k&i k ^^k, is 
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even. Now select pairs of edge-ends completely at random and assign an edge con- 
necting each pair. This procedure results in a variant of the so-called configuration 
model for networks, a popular null model for networks, especially in the physics 
literature ( [MoUoy and Reedj |1995| ). Note that this formulation does allow loops, 
or links to oneself as well as multiple edges between the same pair of vertices, al- 
though the rate of these events decreases with increasing population size for fixed 
maximum degree K, such that several authors have suggested they are negligible 



for K < (dA^)^/^ or i^T < N^/"^, where d is the mean degree of the network (Boguna 



et al. 


2004 ; Burda and Krzywicki[ 2003 


Catanzaro et al. 


2005 


Chung and Lu 


2002 



Foster et al. 2007) 



Now consider a random walk G* on a set of vertices with degrees given by 
di, d2, . . . , d^v, such that I(dj = k) = Njt, where 1(^4) is the indicator function on 
A. Volz and Heckathorn ( 2008[ ) consider the stationary distribution of this random 
walk for a fixed graph. Instead, we consider the transition probabilities of the cor- 
responding walk over the distribution of all networks of fixed degree distribution 
constructed as above, in which the j^^ node visited, G*, is selected from the dis- 
tribution of possible edges from node G*_j^. The transition probabilities are then 
given by: 



P(G* = g;|Gt,G;,...G* 



'<j-l) 



2E-1 



d„.-l 
2E-1 



(2) 



where this probability is taken over the space of all possible configuration model 
graphs of given degree distribution, as well as over the steps of the random walk. 
This procedure results in stationary distribution proportional to dj, and, in fact. 
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selection probabilities at each step very nearly proportional to dj. Thus, for a net- 
work structure given by such a configuration model, the Volz-Heckathorn estima- 
Hansen and Hurwitz ( 1943[ ) estimator, without further require- 



tor constitutes a 



ment of sufficient waves for convergence. 

Now consider the corresponding self-avoiding random walk G. In this proce- 
dure, 

G gi...gj_i. 

This sampling procedure is mathematically equivalent to successive sampling (SS) or 
probability proportional to size without replacement sampling (PPSWOR), dating back to 
Yates and Grundy ( 1953[ ), and typically defined by the following sampling process: 



Begin with a population of units, denoted by indices 1 . . . with varying 
sizes represented by di, d2, . . . d^v, with J2iLi dj = SE, for total edges E. 

Sample the first unit Gi from the full population {1 . . . A^} with probability 
proportional to size dj. 

Select each subsequent unit with probability proportional to size from among 
the remaining units, such that conditional sampling probabilities are given by 



In the survey sampling literature, mostly based on the work of Raj ( 1956[ ) and 



Murthy ( |1957[ ), this sampling design is referred to as probability proportional to size 
without replacement (PPSWOR), and typically used in instances where the desired 
probability proportional to size (PPS) design is not feasible. In such cases, the sizes 
of population units are all known, the sampling design is implemented, and the 



main interest is in estimating the population total or mean of a variable measured 
on sampled units. The analytical properties of this procedure are quite difficult, as 
suggested by more recent work including Rao et alT| ( 1991[ ) and Kochar and Korwar 



(2001). In fact, even the marginal unit sampling probabilities are not available in 
closed form. 

In the geological discovery literature, successive sampling is not a purposive 
design, but an approximation to a non-designed sampling process. Important 
work in this area includes Andreatta and Kaufman] ( 1986[ ), [Nair and Wang ( 1989[ ), 
and Bickel et al. ( 1992[ ). These authors address the case of oil field discovery, in 
which successive fields are discovered with probabilities proportional to some 
measure of their size, typically taken to be their volume. This literature does not 
assume the full population of sizes to be known, and typically takes some function 
of the sizes, such as the sum of the sizes of undiscovered reserves, as the object 
of inference. Our application of successive sampling also assumes the unobserved 
sizes to be unknown, however as in Raj ( 1956[ ) and Murthy ( 1957[ ), our object is 
to use these sizes to estimate the population characteristics on another variable 
measured on all sampled units. To do so, we both develop a novel algorithm and 
leverage more recent work by |Fattorini ( 2006[ ). 



4 Estimation of population means from RDS samples 
based on successive sampling 

Under successive sampling, for population of sizes given by N and sample size n, 
there is a function /tt (fc; n, N) mapping the size A; of a unit to its sampling probabil- 
ity TTj. Our proposed estimator is based on estimated sampling weights, which are 
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based on the tt given by the successive sampling procedure, applied to nodal sam- 
pling units and "sizes" given by nodal degrees. There are three key challenges for 
this approach. The mapping depends on first, the known population size and, 
second, the degree distribution, N, neither of which is known in the general RDS 
case. Finally, given M and the sample size n, the mapping is not explicit. The lack of 
explicit mapping has been addressed by Fattorini ( 2006[ ), who suggests estimating 



the mapping by simulation. For known N and given n, he simulates the successive 
sampling procedure, then estimates the inclusion probability tTj associated with 
unit i by: 

where Uj is the number of times unit i is sampled in the M trials. He proposes 
using these estimated probabilities in the standard Horvitz-Thompson estimator: 



j:S,=l ^ 

In most of this paper, we assume the population size, N, is known. We evaluate 



the sensitivity of our results to that assumption in Section 5.2 and in the examples 
in Section m 

Given the known population size, we present a novel approach to estimating 
the degree distribution N jointly with inclusion probabilities when degrees are only 
observed for sampled nodes. This procedure can be applied beyond the RDS set- 
ting whenever the population distribution of unit sizes corresponding to a sample 
collected through successive sampling is unknown. 

Our approach iteratively estimates the population distribution N and the map- 
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ping /tt (A;; n,N) : k ^ tt. This approach relies on two key points: 

• For known population of degrees N, the mapping /7r(/c;n, N) can be esti- 
mated by simulation in a form similar to Q. 

• For known mapping /tt (A;; n, N), the number (or proportion) of the popula- 
tion of degree k can be estimated using a form similar to (jSj). 

We leverage these two points to propose the following procedure for the esti- 
mation of population mean fi in the case of nodal degrees observed only for sam- 
pled units. 

Let E [■; n, N] denote expectation with respect to a sample of size n sampled by 
successive sampling from a population with degree counts N = {Ni, N2, . . . , Na'}. 
Then: 

E[V,;n,N]=Nkf7z{k;n,N) k = l,...,K (6) 

where is the random variable representing the number of sample units with 
degree k, /tt (A;; N) = E [Sj : = k; n, N] is the (common) inclusion probability 
of a node of degree k, and K is the maximum degree, K < N. This suggests first 
order moment equations for the unknown true N : 

E[Vfc;n,N]=Vfc k = l,...,K (7) 

where is the observed number of sample units with degree k. 
The algorithm is then: 
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1. Initial estimate: 



Mk;nX) = ^f:'f, (8) 
1=1 



that is /7r(/c; n, N°) proportional to k. 
2. For i — 1,. . . ,r iterate the following steps: 

(a) Estimate the population distribution of degrees: 



Vfc 



= TV . ^ri^i^^^^ k^l,...,K (9) 

where is the observed number of sample units with degree k. 
(b) Estimate inclusion probabilities: 

i. Simulate M successive sampling samples of size n from a popula- 
tion with composition N*. 

ii. Estimate the inclusion probabilities: 

N.) = ^%^._il^, (10) 

where is the total number of observed units of size k in the M simula- 
tions. 

3. Estimate the population distribution of degrees and the corresponding inclu- 
sion probabilities via, respectively: N = N'' and 7r(-) — fTr{-;n,W). 

4. Use the resulting mapping tt to estimate /j, via the generalized Horvitz-Thompson 
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estimator: 



- jv s, • (11) 

2^j=i *(dj) 



For computational efficiency, most simulations in this paper were conducted 
with M = 500 and r = 3, with good results. In general, we recommend at least 
M = 2000 and r = 3, and we have used these parameters for the simulations of the 
standard error procedure in the supplemental materials, in the application to real 
data, and in the extension to > 1000 in the discussion. Estimation time scales 
with sample size, population size, and M. In our simulations, with = 1000 and 
M = 500, estimates require about 1.5 seconds on a personal computer, increasing 
to about 6 seconds when M = 2000. In practice, these parameters can be adjusted 
for desired precision in the solution to Higher values of M are particularly 
helpful for more dispersed degree distributions and larger population sizes. 

Simulations have shown the results of this procedure to be at least as good as 



those provided by a first order asymptotic approximation provided by Andreatta 



and Kaufman ( 1986[ ). This approach is novel and its theoretical properties are not 



well understood. It is appealing is to consider the algorithm as a variant of an 



EM algorithm, such as an ECM algorithm (Meng and Rubin 1993), with the un- 
observed part of the degree sequence, dn+i, . . .(In as the latent variable. Unfortu- 
nately, in the design-based frame, these values also fully determine the unknown 



parameters Nk = J2iLi K^i = k), resulting in a degenerate likelihood form. 
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4.1 Between Infinite Population and Full Population Sample 

Consider the limiting case of the moment equation (|7]) where = n. In this case, 
this equation is only satisfied for the degree distribution given by, = v^, result- 
ing in 7r(dj ) = IV j, and ftss = A/ the sample mean. 

Now consider the limit as N oo, for fixed n and fixed maximum degree. 
Then the step-wise selection probabilities for an unsampled node approach values 
proportional to degree: 

(12) 



where pj = ^^'^ . Then P(Binom(n, pi) > 1) — )■ 0, such that 

^ ^ ili = -1, (13) 
77j Pj dj 

for overall inclusion probabilities tt, and step-wise selection probabilities p. There- 
fore, flss AVH- 

In either limit, N ^ n or N ^ oo, fiss approaches an existing estimator. Thus, 
it retains the professed limiting properties of /ivH/ such as robustness to bias based 
on the initial sample, while retaining the favorable finite population characteristics 
of the sample mean in the case of a large sample fraction. In the next section, 
we use simulation studies to argue that for n < < oo, the proposed estimator 
appropriately mediates between these two, and therefore out-performs both. 
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5 Comparing the New and Existing Estimators: 
A Simulation Study 

Our simulation study is designed to highlight the treatment of without-replacement 
sampling in the case of a large sample fraction. To increase the realism of the study, 
we chose parameters to match the characteristics of the pilot data from the CDC 
surveillance program ( Abdul-Quader et al. 2006[ ) wherever possible. The general 



procedure was as follows: 

• 1000 networks are simulated under each test condition. 

• An RDS sample is simulated from each sampled network. 

• RDS estimators are computed from each sample. 

Because the CDC's surveillance system aims for a sample size of 500, and many 



RDS studies approach exhaustion of their populations of interest (CDC 2008 John 



ston 



2009 1, we fix all sample sizes at 500. We also consider a mean degree of 7, close 



to the mean of the pilot data from the CDC study ( Abdul-Quader et al.[|2006 ). 



We assign a discoverable class to each member of the simulated population. 
In reference to studies designed to estimate the prevalence of infectious disease, 
we refer to this characteristic as "infection status," assigning the "infected" status 
(zj = 1) to 20% of simulated population members in each simulation. Note that 
flss could also be applied to a continuous or categorical variable. 

We consider networks sampled from models from the Exponential-family Ran- 



dom Graph Model (ERGM) class (Snijders et al. 2006). Here the relations y are rep 



resented as a realization of the random variable Y with distribution: 



= y|x) = exp{r7-g(y,x) - K(r7,x)} yey, (14) 
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where x are nodal or dyadic covariates, g(y,x) is a vector of network statis- 
tics, T) e W is the parameter vector, 3^ is the set of all possible undirected graphs. 



and exp{K(77,x)} = ^^^-^ exp{r7-g(u, x)} is the normalizing constant (Bamdorff- 



Nielsen 1978[ ). The structure of the networks represented is determined by the 



choice of g(y, x). 

In this study, we vary the structure of the networks in three ways: First, we 
consider populations of sizes (i.e. numbers of nodes) 1000, 835, 715, 625, 555, and 
525, such that samples of size 500 constitute about 50%, 60%, 70%, 80%, 90%, and 
95% of the target population. We focus on this range of sample fractions because 
it highlights settings in which we might expect to see finite population biases such 
as those fiss is intended to address. Another critical feature is the activity ratio 
w, equal to the ratio of the mean degree of infected nodes to the mean degree of 
uninfected nodes. 



This measures the differential tendency for the groups to be socially connected in 
the population. We consider w e {0.5, 0.8, 1, 1.1, 1.4, 1.8, 2.5, 3}. 

We also induce homophily on infection status in these simulations, parameter- 
ized as the relative probability of an edge between two infected nodes, and an edge 
between an infected and an uninfected node. This is an intuitive parameterization, 
also used in Gile and Handcock ( 2010[ ), but differs from that used in other analy- 



ses of RDS. Except in Section 5.3 the edge probability between the two infected 
nodes is fixed at five times that of the mixed dyad. For w = 1, this, along with 
the 20% infected, implies that an edge between two uninfected nodes is twice as 
likely as an edge in a mixed dyad. We consider several other levels of homophily. 
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and in Section 5.3 we show that this feature has important implications for the bias 
induced by biased initial samples, however for the case of initial samples at ran- 
dom with respect to z, as in most of the simulations presented here, bias was not 
affected by level of homophily, although increasing homophily does increase the 
variance of both jlss and jlyu, and is therefore important to consider in standard 
error estimation. 

These features are represented in the ERGM by choosing network statistics to 
represent the mean degree, the activity ratio w, and homophily (based on using 
the "infected" status as a nodal covariate). These three values were specified us- 
ing the "nodemix" statnet model term, which includes three parameters corre- 
sponding to the three cells of the mixing matrix on infection. The parameter rj was 
chosen so the expected values of the statistics were equal to the values given above 
( van Duijn et al.[ [2009). Samples from the resulting models were taken using the 



statnet R package ( [Handcock etaTI [2003| [2008| | 



The RDS sampling mechanism is again designed to mimic that of the CDC's 
pilot study. Ten initial sample nodes were chosen for each sample, selected se- 
quentially with probability proportional to degree, (i.e. by successive sampling). 
In the sensitivity analysis, we also consider initial sample selection regimes de- 
pendent on z. Subsequent sample waves were selected without-replacement by 
sampling up to two nodes at random from among the un-sampled alters of each 
sampled node. Exactly two alters were sampled whenever possible. This process 
typically resulted in the sampling of four complete waves and part of a fifth wave, 
stopping when a sample size of 500 was attained. 

We augment our basic results with two sub-studies evaluating the sensitivity of 
the estimator jlss to assumptions not required for /ivh: the accuracy of the assumed 
population size N and the dependence on the initial sample. 
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5.1 Results 



The Volz-Heckathorn estimator exhibits substantial bias in cases of non-unity ac- 



tivity ratio, and more so for larger sample fractions. This result was noted in Gile 



and Handcock (2010 ), and is illustrated in Figure 1(a) The bias can be understood 
as follows: in the case of higher mean degree among infected nodes (w > 1), and 
large sample fraction, the higher-degree infected samples will be down-weighted 
proportional to their degrees. The true without-replacement sampling probabili- 
ties are closer to uniform than would be suggested by the proportional-to-degree 
estimates, such that these higher-degree nodes are excessively down-weighted, 
leading to negative bias in the estimated proportion infected. The corresponding 
mappings from degree to sampling weight are illustrated in Figure |2| 
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(a) Bias of fivH 



(b) Bias of fiss 



Figure 1: Bias of the Volz-Heckathorn and Successive Sampling estimators from 
samples of size 500 constituting about 50%, 60%, 70%, 80%, 90%, and 95% of the 
population, for varying activity ratio (w). The same samples were used for both 
estimators. 



The SS estimator, however, is not subject to this type of bias, as illustrated in 
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Figure 2: Estimated mappings from nodal degree to inclusion probability for suc- 
cessive sampling samples of size 500 constituting about 50%, 60%, 70%, 80%, 90%, 
95%, and 100% of the population, for simulated network degree distributions with 
w = 1, along with the proportional mapping assumed by the Volz-Heckathom 
estimator (for 50% sample), indicated by oc degree. Note that given the simulated 
degree distribution, the proportional mapping requires probabilities greater than 
1 to attain the desired sample size. 



Figure 1(b) This is the main contribution of the proposed estimatior. In this plot, 
the bias is negligible, with the exception of the case in which w = 0.5. At least 
two factors, related to the strong homophily and the smaller size of the infected 
group contribute to this exception. First, the homophily-induced dependence im- 
plies the initial sample has greater influence on the final sample than in standard 
successive sampling. As this sample is selected first, its selection probabilities are 
closer to proportional to degree than the final successive sampling inclusion prob- 
abilities, contributing to a resulting sample slightly over-representing high-degree 
uninfected nodes. Furthermore, because the infected nodes have low degrees and 
high homophily, they more often fail to produce both possible recruits. The relative 
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group sizes contribute to the difference in the magnitude of this effect for w < 1 
and w > 1. 

The variance of /tg^ is also consistently lower than that of fiyn, combining with 
the lower bias to yield substantially lower mean squared error, often massively so. 
The mean squared error of jlss is less than that of /ivH for the full parameter space 
of Figure [l] The combination of bias and variance effects is visible in Figures 3(a) 
and[3(b)| 

5.2 Sensitivity to Population Size Estimate 



It is important to note that the performance of flss in Figure 1(b) is dependent 
on knowledge of the true population size A^. This is often unrealistic in practice. 
Therefore, we evaluate the performance of /ts^ in the case of over and under esti- 
mation of A^. In particular, we consider small (Ng) and large (Ni) estimates of N 
given by: 

N.^N-"^, /V, = /V + ^. (16) 

Figure |3] depicts the results of these simulations for the case oi w = 1.4. Each 
sub-plot gives the distribution of the estimator over 1000 samples from each of the 
6 population proportions. The four sub-plots represent /ivh (top left), p-ssiN) (top 
right), fiss{Ns) (bottom left) and fissi^i) (bottom right). 

In this case, the underestimation of results in a small positive bias, due 
to the over-estimation of the curvature of /7r(/c;n, N). A small negative bias is 
also present in the case of Ni, due to the under-estimation of the curvature or 
/7r(/c; n, N). In both cases, the bias induced by inaccurate N is less than the bias 
of /tvH/ and the resulting new estimators clearly out-perform the existing estima- 
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Figure 3: /ivH and flss from samples of size 500 constituting about 50%, 60%, 70%, 
80%, 90%, and 95% of the population. Initial samples selected independent of in- 
fection status z. Activity ratio (w) fixed at 1.4. The first plots depict /ivH/ and jlss- 
The last two depict jlss when the number of nodes is under and over estimated, 
respectively. The true value is 0.20. The same samples were used for each estima- 
tor. 

tor. 

Figure |4] presents mean point estimates for other values of activity ratio w. The 
middle set of plots, w = 0.8, corresponds to a case of moderately lower mean 
degree among infected nodes. In this case, the biases introduced by under and over 
estimation of N change sign, but remain small in magnitude. The remaining plots 
illustrate the greater bias possible for extreme values of w. While the biases become 
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larger, they are generally smaller than those exhibited by (Iyr, with the exception 
for the two cases where the negative bias of flss due to small w, as described in 
is compounded by negative bias due to iV < N (w = 0.5, 0.8, 50% 



Section 



5.1 



sample). The mean squared error of jlss is still smaller than that of /tyn in these 
cases. Although the performance of jlss is less robust to N in cases of extreme w, 
/ivH also performs more poorly in these cases. 
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Figure 4: Mean prevalence estimates of four types: /tvH (solid circles), jlss (circles), 
jlssiNs < N) (down-triangles), and jlssi^i > N) (up-triangles), considered for 
50%, 70%, and 90% samples, and differential activity w = 0.5, 0.8, 2.5. The true 
value is 0.20. 

The behavior we see in these plots is typical of other simulations not shown 
here. With the more extreme values of activity ratio w, the bias induced by the 
inaccurate estimation of N is increased, but typically not so much as to cause per- 
formance of jlss worse than that of jlyu- 
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5.3 Sensitivity to Initial Sample and Homophily 

The estimator /tvH requires sufficiently many sample waves to overcome the ef- 
fect of an initial convenience sample and step-wise sampling probabilities not pro- 
portional to degree due to the particular network structure. The estimator fiss, 
not based on a stationary distribution, does not allow for such an argument. We 



therefore argue only that it is not worse than /xvh in this respect. In Section 5.1 
we illustrate that in the simulated networks, bias induced by deviations from the 
configuration model are is no worse for fiss than for /ivh- We focus here on bias 



induced by the initial sample. Gile and Handcock| ( 2010[ ) show that /ivH exhibits 



considerable bias in the case of a biased initial sample and network homophily. We 
expect /t55 to exhibit similar sensitivity, and here compare its performance to that 
of /ivH- 

We consider three regimes for the selection of the initial sample: all uninfected, 
random with respect to infection, and all infected. We also consider five levels of 
homophily, measured by the ratio R defined by: 

^ Probability of an "infected-infected" tie ^ 1 2 3 5 13 (17) 
Probability of an "infected-uninfected" tie ' 

The standard level of homophily used in this paper corresponds to i? = 5. To 
present a comparison most favorable to /tvH/ we treat a population size of 1000, (a 
50% sample), and activity ratio w = 1. 

We consider 1000 samples from each homophily and sampling scenario, and 
summarize performance in terms of absolute bias, variance, and mean squared 
error, the last of which is depicted in Table [l| We find that the variance of /ivh 
always exceeds that of (xssr as does bias in all but a few cases of small differences. 
The MSB of /ivH always exceeds that of fiss- Table [l] illustrates this relation for 
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the case of an all-infected initial sample. The patterns for the other two sampling 
conditions are similar. 



Table 1: Comparison of mean squared error for /ivH arid fiss with all infected initial 
sample and five levels of homophily. MSE{flss) < MSE{ft,YH) for all conditions. 



Homophily (R) 


1 


2 


3 


5 


13 


MSE flyu 

MSE fiss 


0.00029 
0.00026 


0.00036 
0.00032 


0.00054 
0.00052 


0.00150 
0.00140 


0.01525 
0.01449 


Efficiency: MSE /xvh/MSE fiss 


1.12 


1.12 


1.04 


1.07 


1.05 



6 Application to HIV Prevalence in High-Risk 
Populations 

6.1 Background 

The United Nations requires countries to measure and monitor key indicators re 
lated to their HIV epidemics ( |UNAIDS[ [2008| [UNAIDS and World Health Orga 
nization 2007[ ). In particular, countries with epidemics concentrated in high-risk 
groups are required to report on several features of key populations such as (in- 
jecting) drug users ((I)DU), men who have sex with men (MSM), and sex work- 
ers (SW). Such features include HIV prevalence, risk behaviors, and population 
sizes. Because these populations are typically hard-to-reach, many countries rely 
on respondent-driven sampling to estimate HIV prevalence and risk behaviors. In 
this section, we evaluate data collected on IDU and MSM in cities from two differ- 
ent countries in 2007 and 2008. 
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Figure 5: Estimated HIV prevalence in three populations according to jlss, Avh/ 
and the sample mean, for various population size estimates. Dotted lines rep- 
resent 1 standard error above and below figsr according to the estimator in the 
supplemental materials. Vertical bars on (a) represent exogenously estimated pop- 
ulation size and on (c) represent rule-of-thumb 1% to 3% MSM in the population. 
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6.2 Injecting Drug Users in an East European City 

The first example is based on a 2007 survey of IDU in major cities in a former Soviet 
block country The HIV epidemic in this country is largely driven by IDU, where 
prevalence levels are over 50% in many cities, and RDS is used to study this pop- 
ulation, as well as MSM and SW in several larger cities. This country also invests 
heavily in producing estimates of the size of the hidden population through scale- 
up and multiplier methods ( [Kruglov et ar|[2008|p^AIDS/WHO|[2003l ). We focus 



here on the results for one city Because in very large populations we expect jlss 
to be nearly identical to fiyn, we consider a city with one of the smaller estimates 
for the size of the hidden population. In this city, the number of IDU is estimated 
at 1200, with confidence interval 1100-1400. The RDS sample began with 6 initial 
samples, and distributed 3 coupons per respondent. The sample size was 175, all 
with full degree and HIV data. The two longest sample chains ended at wave 9, 
with 4 total respondents from wave 9. 

We estimate the HIV prevalence using flss for several population sizes includ- 
ing 1100, 1200, and 1400, and also report the corresponding estimates of /tvH and 



the sample mean. These results are summarized in Figure 5(a) We find that the 
prevalence estimate based on flss is very close to that based on /ivh for population 
sizes in this range, and well within the uncertainty of the estimate, according to 
the estimator in the supplemental materials. This pattern is consistent across the 
cities we have considered in this country. This is partly due to the policy in this 
country of using RDS only in cities with larger estimated population sizes, opt- 
ing for strategies of institutional sampling or attempted complete enumeration in 
areas with smaller populations. 
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6.3 Drug Users in a Caribbean City 

The second example is taken from a small Caribbean country, which has conducted 
RDS studies of drug users (DU), MSM, and SW in four main cities in 2008 (note that 
this study did not limit participation to injecting drug users). We focus here on the 
study of DU in one of the cities with smaller total population. In this study, there 
were 7 initial samples, resulting in a sample of size 301, of which we include here 
only the 285 with full degree and HIV information. Again three coupons were 
distributed to most respondents, although this number was reduced as the sample 
size approached the target of 300. The two longest sample chains reached wave 11, 
with 6 respondents from this wave. 

In this city, the number of DU is unknown. Therefore, we use the successive 
sampling estimator to produce a sensitivity analysis for the effect of population 
size on the HIV prevalence estimate. The results of this analysis are shown in Fig- 
ure 5(b) Here the point estimate varies from 6.2% for a population size of 301 to 
9.6% for a population size of 6000, with /tvH = 9.7%. In absolute and relative terms, 
this is more variation than in the previous example, (3.4% or about 15% of jlyu)- 
Although these differences are still well within the uncertainty of the estimator, 
in many cases RDS point estimates are important in themselves. It is also possi- 
ble that a sensitivity analysis such as this one will be of interest with respect to a 
particular prevalence threshold, such at the 5% threshold on which UN AIDS bases 
national epidemic classifications. In this example, the point estimate of prevalence 
is above 5% for all population sizes, while a nominal 90% confidence interval in- 
cludes 5% for all population sizes. 
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6.4 Men who have Sex with Men in a Caribbean City 

Finally, we consider a population of MSM in the same Caribbean city. This study 
design was very similar to the corresponding DU study, with 7 initial samples, 
and a maximum of 3 coupons distributed. Here, the sample size was 270, with 
complete information for 269. One chain in this sample reached wave 12, with two 
respondents in that wave. 

Figure 5(c)| illustrates the sensitivity of the prevalence estimate to the number of 



MSM in this city. For comparison, the horizontal axis corresponds to the same pop- 



ulation sizes as in 5(b) although here it is labeled in proportions. The prevalence 
estimate changes by about 1.6 percentage points over this range of possible popu- 
lation sizes (9.5% for 301 or .2% of the population are MSM, to 7.9% for 6000 or 4.2% 
of the population are MSM) . There are no additional studies available to provide 
population size estimates, however this problem is less severe for populations of 
MSM, as a rule-of-thumb is often applied, estimating the number of MSM at about 
l%-3% of the general population, corresponding to the two vertical bars on Fig- 
ure 5(c) In this case, however, further information is available. This sample did 
not reach its desired sample size (300), because no more coupons were returned, 
and the research team was not able to find additional MSM to sample. Thus the 
sample neared exhaustion of the portion of the population available for sampling, 
suggesting that although the population of MSM might be 1-3% of the city popu- 
lation, a smaller portion of those may be connected to the giant component of the 
social network of MSM and willing to participate in such a study. Therefore, if we 
restrict our inference to the reachable portion of the target population, the relevant 
population size is likely much closer to the sample size, suggesting a value of the 
jlss estimate over 1.5 percentage ponts higher than /ivH = 7.8%. Note that among 
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the 12 RDS studies in this country, three samples exhibited this near exhaustion 
behavior. 

It is also of interest to note that in this example, unlike the previous two, jlss 
is consistently higher than /tyn- This result is consistent with the expectation that 
flss is typically between /ivH and the sample mean. In this case, the degree-based 
weights of /ivH reduced the overall weight given to infected nodes, and the more 
moderate weights used in jlss reduced this effect. 



7 Discussion 

The key insight of this paper is the recognition that the true mapping /tt (A;; n, N) 
from nodal degree dj to inclusion probability, tt^ under Respondent-Driven Sam- 
pling is better approximated by successive sampling than by the linear mapping 



assumed by Volz and Heckathom (2008). In addition, we introduce a novel ap 



proach to estimating the unit size (or degree) distribution in the population, based 
on the population size and sizes of observed units under successive sampling. 
Combining this insight and estimation strategy, we present a new estimator for 
population proportions based on an RDS sample. 

The contribution of this new estimator is illustrated in Figure |6] In cases with 
no correlation between degree and quantity of interest {w = 0), and initial sample 
selected at random with respect to infection status, the naive sample mean, /i is an 
unbiased estimator of the population mean, as are existing estimators such as /ivH 
and the proposed estimator jlss- This is illustrated in the first three columns of the 
figure. 

However, when the variable of interest is related to nodal degree, classes with 
higher degrees are over-represented in the population mean, resulting in the posi- 
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tive bias in the sample mean fi in the second three columns of Figure [6j This bias 
shrinks as the sample fraction increases. The existing estimator /tvH adjusts reason- 
ably well for this effect when the sample fraction is small, however for larger sam- 
ple fractions, this estimator over-compensates for the effect of degree distribution, 
resulting in bias opposite that of fi. The contribution of the proposed estimator, 
fiss is that it correctly adjusts for the joint effects of varying degree distributions 
and large sample fractions. 

The last three columns of Figure |6] illustrate a shortcoming of all three of these 
estimators. In this case, all initial samples, or seeds, are selected from among the 
infected nodes. This results in increased positive bias in all three of these estima- 



tors. All three estimators are also subject to other sources of bias discussed in Gile 



and Handcock ( 2010[ ), including bias induced by the systematically biased passing 



of RDS coupons. 

Because RDS is in wide usage, and often in cases where the sample fraction is 
large, this new estimator may improve estimation in many contexts. Its applica- 
bility is limited, however, by the requirement that the population size is known. 
Figures |3] and |4] illustrate that inaccurate estimates of N can introduce bias into 
fissr although this new estimator still out-performs /xvh with the level of inaccu- 
racy considered in this study. 

While most of our simulation study has focused on sample fractions consistent 
with the finite population effects that fiss is intended to address, it is of interest 
to note that fiss is nearly identical to /ivH for small sample fractions. We have 
replicated much of the simulation study here with population size = 10, 000. 
We find no significant differences between fiss and /ivh in these simulations, over 
a full range of values of w. When using inaccurate estimates of as in ( |T6] >, we 
found only one case of significant difference between /ivH and fiss, corresponding 
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Figure 6: Mean prevalence estimates of three types: /tyn (solid circles), fiss (circles), 
and sample mean /t (crosses), considered for 50%, 70%, and 90% samples, and 
activity ratio w = 1,1.8, and for initial sample (seeds) selected at random with 
respect to infection and all infected. None of the estimators exhibit bias in the face 
of homophily with w = 1. When w ^ 1, fiyu and ft exhibit bias, with magnitude 
sensitive to sample fraction. All estimators exhibit bias in the case of biased initial 
sample. 

to a dramatic under-estimate of population size, Ng = 5250, = 10, 000, and for 
high activity ratio, w = 3. In this case, jlss had bias 0.0139 and fxyn had bias 0.0105. 
The difference in mean squared error was not significant. Thus, the proposed es- 
timator is helpful in correcting for finite population biases when they exist, and 
sensitive to the estimated population size in these cases. In cases where finite 
population effects are not present, the proposed estimator nearly coincides with 
existing estimator jlyu, and, unless the inaccurate population size estimate is small 
enough to transition to the region of finite population sensitivity, jlss is insensitive 
to population size estimates in these cases. 

Known population size and random mixing are key assumptions of the es- 
timator jlss- Additional assumptions required by jlss are listed in Table |2] The 
assumptions with strikothrough are necessary for jlyn but not for jiss, and the 
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Table 2: Assumptions of jlss- Assumptions with strikothrough apply to jlyu but 
not jlss- Assumptions in the italic apply to fiss but not /ivH- 





Network Structure 
Assumptions 


Sampling Assumptions 


Random Walk 
Model 


Population size largo (N » n) 


Sampling with roplacomont 
Single non-branching chain 


Remove Initial 

Sample 

Dependence 


Homophily weak enough 
Connected graph 


Sufficiently many sample waves 
Initial sample unbiased 


To Estimate 
Probabilities 


All ties reciprocated 
Known population size N 


Degree accurately measured 
Random referral 



italic text indicates additional assumptions required by jlss but not jlyu- In these 
terms, the key contribution of jlss is to remove the dependence on a known inac- 
curate random walk model for estimating sampling probabilities. In return, jlss 
sacrifices the theoretical robustness to the initial sample promised by the Markov 
chain model, although this robustness was not truly present in /ivH either, and the 
proposed estimator performs no worse than the former in this respect (Table [l]). 
More critically, jlss relies on the assumption of known population size. The esti- 
mator jlss is also sensitive to the degree of non-random mixing, or homophily in 
the network, although no more sensitive than /xvh (Table [T]), as well as to other as- 



sumptions such as accurate self-reported degree and an undirected network. Gile 



and Handcock ( 2010[ ) illustrate the sensitivity of jlyn to deviations from some other 



assumptions in Table |2] We do not expect jlss to be more robust to such deviations 
than /ivH- 

Our application to HIV prevalence estimation in Section |6] illustrates several 
uses for jlss- First, when the population size is known, jlss provides a better esti- 
mate of population proportions than the other available methods. This can also be 
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done when only a range of population sizes is known. In particular, in the case of 
MSM, officials are often willing to assume that the population ranges between 1% 
and 3%. When no information is available on population size, fiss can be used to 
perform a sensitivity analysis. Finally, other information from the sampling pro- 
cess, such as the exhaustion of the population available for sampling, may suggest 
that the sample fraction is large. Additional information may also be gathered by 
asking respondents about their exposure to others who have been sampled. 

Intuition suggests that in the case of a large sample fraction, an RDS estimator 
should negotiate between the infinite-population assumption of the Volz-Heckathorn 
estimator and the full-population sample assumption of the naive sample mean. 
In this paper, we provide an estimator based on a successive sampling model that 
appropriately negotiates between these two extremes. Whenever the hidden pop- 
ulation size is known, it will be preferable to use this estimator. When the hid- 
den population size is not known, this estimator still provides a helpful diagnostic 
check on the other available estimators. 

Beyond the estimator itself, this paper contributes a new theoretical framework 
for understanding the sampling process in respondent-driven sampling. Previ- 
ous understandings have relied on a with-replacement or infinite population as- 
sumption, an assumption known to be inaccurate, and critically so in some popu- 
lations. The introduction of a sampling model appropriately accommodating the 
true without-replacement nature of the sampling process opens new possibilities 
for future research on respondent-driven sampling. We intend to make code avail- 
able for these procedures in the R package RDS on CRAN. 
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